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Abstract. We introduce a novel online Bayesian method for the identification of a family 
of noisy recurrent neural networks (RNNs). We develop Bayesian active learning tech- 
nique in order to optimize the interrogating stimuli given past experiences. In particular, 
we consider the unknown parameters as stochastic variables and use the D-optimality 
principle, also known as Hnfomax method', to choose optimal stimuli. We apply a greedy 
technique to maximize the information gain concerning network parameters at each time 
step. We also derive the D-optimal estimation of the additive noise that perturbs the dy- 
namical system of the RNN. Our analytical results are approximation-free. The analytic 
derivation gives rise to attractive quadratic update rules. 

1 Introduction 

When studying online systems it is of high relevance to facilitate fast information gain con- 
cerning the system |l|2j . As an example, consider the research on real neurons. In one of the 
experimental paradigms, researchers look for the stimulus that maximizes the response of the 
neuron [314] . Another approach searches for stimulus distribution that maximizes mutual infor- 
mation between stimulus and response [5j. A recent technique assumes that the unknown system 
belongs to the family of generalized linear models [6] and treats the parameters as probabilis- 
tic variables. Then the goal is to find the optimal stimuli by maximizing mutual information 
between the parameter set and the response of the system. 

We are interested in the active learning |7I8I9|1Q| of noisy recurrent artificial neural networks 
(RNNs), when we have the freedom to interrogate the network and to measure the responses. 
Our framework is similar to the generalized linear model (GLM) approach used by [6]: we 
would like to choose interrogating, or 'control' inputs in order to (i) identify the parameters 
of the network and (ii) estimate the additive noise efficiently. From now on, we use the terms 
control and interrogation interchangeably; control is the conventional expression, whereas the 
word interrogation expresses our aims better. We apply online Bayesian learning |ll|12H3fT4] 
to accomplish our task. For Bayesian methods prior updates often lead to intractable posterior 
distributions such as a mixture of exponentially numerous distributions. Here, we show that 
in our model computations are both tractable and approximation-free. Further, the emerging 
learning rules are simple. We also show that different stimuli are needed for the same RNN 



Present address: Department of Computing Science, University of Alberta, Athabasca Hall, Edmon- 
ton, Canada, T6G 2E8 



model depending on whether the goal is to estimate the weights of the RNN or the additive 
noise that perturbs the RNN. Hereafter we will refer to this noise as the 'driving noise' of the 



Our approach, which optimizes control online in order to gain maximum information con- 
cerning the parameters, falls into the realm of Optimal Experimental Design, or Optimal 
Bayesian Design |15I1I16|17I18| . Several optimality principles have been worked out and their 
efficiencies have been studied extensively in the literature. For a review see, e.g., [19]. Our ap- 
proach corresponds to the so-called D-optimality |20|2JLj , which is equivalent to the information 
maximization (infomax) principle applied by [6]. We use both terms, the term D-optimality and 
the term infomax, to designate our approach. To the best of our knowledge, D-optimality has 
not been applied to the typical non-spiking stochastic artificial recurrent neural network model 
that we treat here. 

The contribution of this paper can be summarized as follows: We use the D-optimality 
(infomax) principle and derive cost functions and algorithms for (i) the parameter learning of 
the stochastic RNN and (ii) the estimation of its driving noise. We show that, (iii) using the D- 
optimality interrogation technique, these two tasks are not compatible with each other: greedy 
control signals derived from the D-optimality principle for parameter estimation are suboptimal 
(basically the worst possible) for the estimation of the driving noise and vice versa. We show that 
(iv) D-optimal cost functions lead to simple greedy optimization rules both for the parameter 
estimation and for the noise estimation, respectively. Investigation of non-greedy multiple step 
optimizations, which may achieve more efficient estimation of the network parameters and the 
noise, seems difficult and is beyond the scope of the present paper. However, (v) for the task of 
estimating the driving noise we introduce a non-greedy multiple step look- ahead heuristics. 

The paper is structured as follows: In Section [2] we introduce our model. Section [3] concerns 
the Bayesian equations of the RNN model. Section|4]derives the optimal control for its parameter 
identification starting from the D-optimality (infomax) principle. Section[5]deals with our second 
task, when the goal is the estimation of the driving noise of the RNN. The paper ends with a 
short discussion and some conclusions (Section [6]). 



We introduce our model here. Let -P(e) = A/" e (m, V) denote the probability density of a normally 
distributed stochastic variable e with mean m and covariance matrix V. Let us assume that 
we have d simple computational units called 'neurons' in a recurrent neural network: 



where {e t }, the driving noise of the RNN, denotes temporally independent and identically 
distributed (i.i.d.) stochastic variables and P(e t ) — W et (0, V), r t G R d represents the observed 
activities of the neurons at time t. Let u t £ R c denote the control signal at time t. The neural 
network is formed by the weighted delays represented by matrices F^ (i = 0, . . . , /) and Bj 
(j = 0, . . . , J), which connect neurons to each other and also the control components to the 
neurons, respectively. Control can also be seen as the means of interrogation, or the stimulus 
to the network [6]. We assume that function g : R d — > R d in ([TJ) is known and invertible. The 



RNN. 



2 The Model 




(1) 



computational units, the neurons, sum up weighted previous neural activities as well as weighted 
control inputs. These sums are then passed through identical non-linearities according to Eq. (U). 
Our goal is to estimate the parameters F l E R dxd (i = 0, . . . , I), B 3 E R dxc (j = 0, . . . , J) and 
the covariance matrix V, as well as the driving noise e t by means of the control signals. 

In artificial neural network terms, ([I]) is in the form of rate code models. In our rate code 
model, noise, control, and the recurrent activities influence the firing rates similarly. We show 
that analytic cost functions emerge for this model that are free of approximation. 



3 Bayesian Approach 

Here we embed the estimation task into the Bayesian framework. First, we in- 
troduce the following notations: x t+i = [r t _j; . . . ;r t ; u t -j+i; ■ ■ ■ ; u t+ i], y t +i = g _1 (r t+ i), 
A = [F/, . . . , F , Bj, . . . , Bo] 6 ffi dxm . With these notations, model (HJ reduces to a linear equa- 
tion 

y t =Ax i +e i . (2) 

To fulfill our goal, the online estimation of the unknown quantities (parameter matrix A, noise 
e t and its covariance matrix V), we rely on Bayes' method. We assume that prior knowledge is 
available and we update our posteriori knowledge on the basis of the observations. Control will 
be chosen at each instant to provide maximal expected information concerning the quantities we 
have to estimate. Starting from an arbitrary prior distribution of the parameters the posterior 
distribution needs to be computed. This can be highly complex, however, so approximations are 
common in the literature. For example, assumed density filtering, when the computed posterior 
is projected to simpler distributions, has been suggested |22|23|li] . We shall use the method of 
conjugated priors [24] instead. For matrix A we assume a matrix valued normal distribution 
prior. For covariance matrix V inverted Wishart (IW) distribution will be our prior. One can 
show for these choices that the functional form of the posteriori distributions is not affected. 

We define the normally distributed matrix valued stochastic variable A E K dxm by us ing 
the following quantities: M E fl^xm j s the expected value of A. V G M. dxd is the covariance 
matrix of the rows, and K S ]R" lxm is the so-called precision parameter matrix that we shall 
modify in accordance with the Bayesian update. They are both positive semi-definite matrices. 
The density function of the stochastic variable A is defined as: 

A/a(M, V, K) = |2 ^ m 2 /2 exp(-i*r((A - MfV^A - M)K)), 

where tr, | ■ |, and superscript T denote the trace operation, the determinant, and transposition, 
respectively. See e.g. |25I26| . We assume that Q S M. dxd is a positive definite matrix and n > 0. 
Using these notations, the density of the Inverted Wishart distribution with parameters Q and 
n is as follows \25\: 



VK d+1 )/ 2 
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exp (-±tr(V- 1 Q)) ) 
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d 

where Z 7U d = n^^ 1 ^ 4 Yl r((n + 1 — «)/2) and r(.) denotes the gamma function. 

i=l 



Now, one can rewrite model ^ as follows: 



P(A|V)=JVa(M,V,K), (3) 

PCV)=2Wv(Q,»), (4) 
P( ei |V)=A4 t (0,V), (5) 
P(y t |A )Xt ,V)=AT yt (Ax t) V). (6) 

4 The Infomax Approach for Parameter Learning 

Let us compute the parameter estimation strategy for task (TJ (i.e., for task l(3|)-((6])) as 
prescribed by the infomax principle. Let us introduce two shorthands; = {A,V}, and 
{x}^ = {xj, . . . , Xj}. We choose the control value in ([1]) at each instant such that it provides 
the most expected information concerning the unknown parameters. Assuming that {x}* , {y}\ 
are given, according to the infomax principle our goal is to compute 

argmax J(0,y t+1 ; {x}*/ 1 , {y}*), (7) 

u t+ i 

where I(a, 6; c) denotes the mutual information of stochastic variables a and b for fixed param- 
eters c. Let H(a\b; c) denote the conditional entropy of variable a conditioned on variable b and 
for fixed parameter c. Note that 

1(6, y t+1 ; {x}*+\ {y}\) = H(0; {x}< + \ {y}\) - H(0\y t+1] {x}\ + \ {y}\), 

holds [27j and H (9; {x}* +1 , {y}*) = H(8; {x}|, {y}*) is independent from u t +i, hence our task 
is reduced to the evaluation of the following quantity: 

argminF(0| yt+1 ;{x} 4 1 +1 ,{y} t 1 ) = (8) 

U t +1 

= argmin-y dy t+1 P(y t+1 \{ X }\ +1 , {y}*) J ddP(G\{x}\ + \ {y}* +1 ) logP(e|{x}*+\ {y}*+^ 



In order to solve this minimization problem we need to evaluate P(yt+i |{x}^ +1 , {y}*), the 
posterior P(0|{x}* +1 , {y}* +1 ), and the entropy of the 

posterior, that is J dOP(8\{x} f 1 +1 ,{y} t + 1 )logP(8\{x} f 1 +1 ,{y} t 1 +1 ), where P{a\b) denotes the 
conditional probability of variable a given condition b. The main steps of these computations 
are provided below. 

Assume that the a priori distributions P(A|V, {y}\ ) = A/"(A|M t , V, K t ) and 
P(V|{x}*, {y}*) = IWv(Qt,fit) are known. Then the posterior distribution of 6 is: 

P , A v ,r x t +1 S ,t+u _ P(y t+ i|A,V,x t+1 )P(A|V,{x} t 1 ,{y} t 1 )P(V|{x} t 1 ,{y}«) 
P(A,V|{x }l ,{y }l )- P(y t+1 |{x}^,{y}i) 

= M yt+1 (Ax t+ i, V)jV A (M t ,V, K t )JWy(Q t , n t ) 
Ja fvK t+ i (Ax t+ i, V)M A (M U V, K f )IWv(Q t , n t )) ' 

This expression can be rewritten in a more useful form: let K £ R mxm and Q £ M dxd be 
positive definite matrices. Let A £ R dxm , and let us introduce the density function of the 



matrix valued Student-t distribution |28|26| as follows: 

|K| d / 2 Z n+m . d IQI"/ 2 



T A (Q,n,M,K) 



7r*"/2 z n4 |Q + (A-M)K(A-M) T |(™+«)/ 2 ' 
Now, we need the following lemma: 
Lemma 41 

A/y(Ax,V)A/k(M,V,K)JW v (Q,n) = A/" A ((MK + yx T )(xx T + K)-\V,xx T + K) x 
xIWv (Q + (y - Mx) (1 - x T (xx T + K) _1 x) (y - Mx) T ,n+lj x 

x7y (Q,n,Mx, 1 - x T (xx T + K)~ 1 x) . 

Proof. It is easy to show that the following equations hold: 

A/y (Ax, V)A/a (M, V, K) = A/a (M+ , V, xx T + K)7V y (Mx, V, 7), 
AA A (M,V,K)XW v (Q,n) = JW V (Q + H, n + 1) T A (Q, n, M, K), (9) 

where M+ = (MK + yx T )(xx T + K) -1 , 7 = 1 — x T (xx T + K) _1 x, H = (A — M)K(A — M) T 
for the sake of brevity. Then we have 

A/y(Mx, V, 7)2W V (Q, n) = XW V (Q + (y - Mx) 7 (y - Mx) T , n + l)T y (Q, n, Mx, 7) , 

and the statement of the lemma follows. 

Using this lemma, we can compute the posterior probabilities. Let us introduce the following 
quantities: 

7t+i = 1 - xf +1 (x t+ ixf +1 + K t ) -1 x t+ i, 
nt+x =n t + l, 
M t+ i = (M t K t +y i+1 xf +1 )(x t+1 xf +1 +K0" 1 , 

Qt+i = Qt + (yt+i - M t x t+ i) 7t+i (y t+x - Mtx f+1 ) T . (10) 
For the posterior probabilities we have determined that 

P(A|V,{x}i +1 ,{y}i +1 ) =AOv(M t+1 ,V,x t+1 xf +1 +K t ), (11) 



P(V|{x}* +1 ,{y} t + 1 ) = XW v (Q i+ i,n t+1 ), (12) 

yt+i 



P(y t+1 1 {x}* +1 , {y}i ) = T yt+1 (Q t , n t , M t x t+1 , 7t+1 ) . 



Having done so, we can compute the entropy of the posterior distribution of 6 = {A, V} by 
means of the following lemma: 

Lemma 42 The entropy of a stochastic variable with density function P(A, V) = 
Af A (M, V, K)IW V (Q,n) assumes the form - f In |K| + (m±*H) In |Q]+/i {d, n) , where h (d, n) 
depends only on d and n. 



Proof. Let vec(A) denote a vector of dm dimensions where the (d(i — 1) + 1) , . . . , (id) th 
(1 < i < m) elements of this vector are equal to the elements of the i th column of matrix 
A 6 Rrfx™ m ^he appropriate order. Let ® denote the Kronecker-product. It is known that 
forP(A) =7V A (M,V,K), P(vec(A)) = M vec{A) {vec(M), V ® K" 1 ) holds [26]. Using the well- 
known formula for the entropy of a multivariate and normally distributed variable [27] and 
applying the relation |V ® K _1 | = |V| m /|K| d , we have that 



H(A;V) = -lnjVtgjK^ 1 ! 



dm , , . to , . 
— ln(27re) = -ln|V| 



■lnlKI 



dm 



ln(27re) 



Exploit certain properties of the Wishart distribution, we compute the entropy of distribution 
IWv(Q,n). The density of the Wishart distribution is defined by 



1 



Wv(Q,n) = ^|V|("- d - 1 )/ 2 



Q 1 



n/2 



,-xp ( -itr-CVQ" 1 ; 



Let W denote the digamma function, and let f2(d, n) = — J2i=i^( 2 ' )) — din 2. Replacing 
V" 1 with S, we have for the Jacobian that |f|| = \^3zr\ = |S|~ (d+1) [25]. To proceed we use 



that i?w s (Q,n)S = nQ, and Bw s (Q,n) m |S| = In |Q| — f2(d,n), [29] and substitute them into 
Exw v (Q,n) ln|V|, and ^iw v (Q,™)* r (Q v_1 ) : 



1 



1 



Z n ,d |V|^+D/2 
1 



V X Q 



'/' 2 



^XWvCQ.n) m l V l - 
• M» I -itrOT^)) ln|V|rfV 



SQ 



'/2 



f 1 \S\(™-d-D/2 
J Zn,d 



l/2 



exp ( --tr(SQ) ) In |S| |S| _ d_ 1 rfS 



exp --ir(SQ) InjSjdS 



ln|Q|+/ 2 (d,n) 



(13) 



One can also show that 



1 



1 



Z n , d |V|(<i+l)/2 
1 



V !Q 



|g|(d+l)/2 



2 

SQ 



n/2 



^IWv(Q,n)^(QV- 1 ) - 

exp ( _Jt r (V- 1 Q) ] tr(Q\- x )dV 



1/2 



f ^_|g|(«-d-l)/2 
J Z„M 



exp ( -^ir(SQ) ] *r(QS)|Sr <i-1 dS > 



i/2 



exp ( --tr(SQ) 1 ir(QS)dS, 

= £; Ws(Q- 1 ,«)^(QS), 

= £r(QQ _1 n) = nd 



(14) 



We calculate the entropy of stochastic variable V with distribution IWv(Q, n). It follows from 
Eq. JTl and Eq. {H} that 



-E- 



JW v (Q,n) 



- \n{Zn :d ) + — In 



ff(V) = 

— 2 l n |V|--tr(V- 1 Q) 



\n{Z n ^ d ) - - In 



n + d+ 1 



ln|Q|-^«?(^±^))- ,/h,2 



2 

d+1 



ln|Q| + / 3 (d,n), 



where fs(d,n) depends only on d and n. 

Given the results above, we complete the computation of entropy H(A, V) as follows: 

H (A, V) = H(A\ V) + H(V) = H(V) + J dVZW v (Q, n)H(A; V) 

= J rfVXW v (Q, n) In |V| -|ln|K| + ^ ln(27re)) + H(V) 

- In |K| + ^ ln(27r e ) + y [In |Q| + f 2 (d, n)} + In |Q| + / 3 (d, n) 

= -^ln|K| + ( m+ 2 rf+1 )ln|Q| + /l (d,n). 

This is exactly what was claimed in Lemma l42l 

Lemmas |4l] and H21 lead to the following: 

Corrolary 43 For the entropy of a stochastic variable with posterior distribution -P(A, V|x, y) 
it holds that 



ff(A,V;x,y) 



dn|xx T + K| + fi(d,n) 



,m + d+l 



2 , , v-v 7 / v 2 

We note that the following lemma also holds: 
Lemma 44 



)ln|Q + (y-Mx) 7 (y-Mx) 2 



/ T y (Q, n, /x, 7 ) In |Q + (y - /x) 7 (y - A*) T |dy 



«'s independent from both \i and 7, 

and thus we can compute the conditional entropy expressed in ([8]): 
Lemma 45 

J7(A,V|y;x) - J p (y|x) H(A, V; x, y)dy = ~ In |xx T + K| + 5l (Q, d, n). 
where gi(Q,d,n) depends only on Q, d and n. 



Collecting all the terms, we arrive at the following intriguingly simple expression 

u t+1 o Pt =argmin f p (y t+1 |{x}'+\ {y}*) H(A, V|{x}'+\ {y}\, yt+x)dy t+ i 
"t+i J 

= argmin--ln|x t+ ixl_ 1 +K t | = argmaxxf +1 K t _1 x t+ i, (15) 

u t+ i I u t+ i 

where 

xt+i = [rt-i; • • • ; r f ; ut-j+x; ■ ■ ■ ; ut+i], 

and we used that |xx T + K| = |K|(1 + x T K _1 x) according to the Matrix Determinant Lemma 
[30] . We assume a bounded domain U for the control, which is necessary to keep the max- 
imization procedure of lfl5|) finite. This is, however, a reasonable condition for all practical 
applications. So, 

\i t+ x»pt = argmaxxL_ 1 Kj" 1 Xt+i, (16) 

In what follows D-optimal control will be referred to as 'infomax interrogation scheme'. The 
steps of our algorithm are summarized in Table [TJ 



Table 1: Pseudocode of the algorithm 



Control Calculation 

ut+i = argmaxufflX^iK^xt+i 

where x t+ i = [r t _j; . . . ; r t ;u t _j+i; . . . ;u t ;u] 

set x t+ i = [r t -r, . . . ;r t ; ut-j+i; . . . ; u*; ut+i] 
Observation 

observe r t+ i, and let y t +i = g~ 1 (r t +i) 
Bayesian update 

M t+ i = (M t K t + y t +ixf + i)(x t+ ixf +1 + K t ) _1 

K t+ i = x t+ ixf +1 + K 4 

nt+i = n t + 1 

7t+i = 1 - xf +1 (x t+ ixf +1 + K l )" 1 x i+ i 

Qt+i = Qt + (yt+i — Mtxt+i) yt+i (yt+i — MjX t +i) T 



Computation of the inverse (xt+ix^ +K t ) 1 in Table[T]can be simplified considerably by 
the following recursion: let Pt = K^ 1 , then according to the Sherman-Morrison formula [31] 

P t+1 = (x t+1 xf +1 + K.r 1 = P, - ?:*7i +lPt ■ (17) 

1 + Xt+l-rtxt+l 



In this expression matrix inversion disappears and instead only a real number is inverted. 



5 Estimating the Noise 



One might wish to compute the optimal control for estimating noise e t in (Q]), instead of the 
identification problem above. Based on |TJ and because 

/ J 

et+i = yt+i -^F,-r t _i -^Bj-u t+ i_j, (18) 

i=0 j=0 

one might think that the best strategy is to use the optimal infomax control of Table [TJ since 
it provides good estimations for parameters A = [F/, . . . , Fo, Bj, . . . , Bo] and so for noise e t . 

Another — and different — thought is the following. At time t, let us denote our estimations 
as et, F' (i=0,. . . ,1), and B* (j=0,, . . ,J). Using HH]), we have that 

i J 
e*+i - et+i = £)(F< - Fj)r t _i + ^(B, - B*)u t+1 _,-. (19) 

i=0 j=0 

This hints that the control should be u t = for all times in order to get rid of the error 
contribution of matrix B j in lfT9|) . 

Straightforward utilization of D-optimality considerations, opposed to the objective of ([7]), 
suggests the optimization of the following quantity: 

argmax/(e t+ i, y t+i ; {x}* +1 , {y}*), 

u t +i 

That is, for the estimation of the noise we want to design a control signal Ut+i such that the 
next output is the best from the point of view of greedy optimization of mutual information 
between the next output yt+i and the noise e t +\. It is easy to show that this task is equivalent 
to the following optimization problem: 

argmin f dy t+1 P(y t+1 |{x}* +1 , {y}\)H (e t+1 ; {x}* +1 , {y}^ 1 ), (20) 

Ut+l J 

where H(e t+ i;{x} t 1 +1 ,{y}\ +1 ) = ff(Ax i+1 ;{x} f 1 +1 , {y}*i +1 ), because e t+ i = y t +i - Ax t+ i, To 
compute this quantity we need the following lemma |26| : 

Lemma 51 If P(A) = M A (M, V, K), then P(Ax) = A^ Ax (Mx,V, (x T K- 1 x) _1 ) 
Applying this lemma and using l[TTj) one has that 

P(Ax t+1 |V,{x}*+ 1 ,{y}* 1 ) = A/A Xt+1 (M t+1 x t+l! V, (x^iK^xt+i)" 1 ) (21) 
We introduce the notations 

K t+1 - (xf +1 K- + 1 1 x t+1 )" 1 e K, (22) 
At+i - 1 + (Ax t+ i - M t+ ixt+i) T {Kt+iQt+i)(Axt+i - M f+ ix t+ i) g R 



and use |(9]) and (fl2| for the posterior distribution lj2lj) . Then we arrive at 
P(Ax t+1 |{x}* +1 ,{y}*) = T Axt+1 (Q t+ i,7i t+1 ,M t+1 x t+1 ,X t+1 



= n-^\K- + \d t+1 \-^ 



a; 



The Shannon-entropy of this distribution according to [32] equals: 

if(Ax t+1 ; {x}^ 1 , {y}^ 1 ) = h(d,n t+1 ) + | log \K^\ + log |Q t+1 | 



where 

f 4 (d, n t+ i) = - log 



n f+ i+l-d 



7r d/2 jr ( 



n f+ i + 1 — d 



Using the notations introduced in fTO)) and in (|22|) the above expressions can be transcribed as 
follows: 



P(Ax t+1 ;{x} t + i ,{y} t + i ) = h{d,n t+ i) - - log |Jf t+1 | + log |Q t+1 | 

= f A {d,n t+1 ) + ^log|xf +1 (K f +x t+ ixf +1 )" 1 x t+ i| + log|Q i+ i| 

= fi{d,n t+l ) + ^log|xf +1 (K t +x t+ ixf +1 )" 1 x t+ i| + 
+ log |Q t + (y i+ i - M t x t+ i) 7 t+ i (y t+ i - M t x t+ i) T | 
Now, we are in a position to calculate lf20j) by applying Lemma l44l as before. We get that 

J dy t+1 P(y t+1 |{x}* +1 ,{y} f 1 ) J ff(e i+1 ;{xr+ 1 ! {y} t + 1 ) = 

= 92(Qt,d,n t+1 ) + ^log|xf +1 (K t +x t+1 xf +1 )~ 1 x f+1 |, 

where <?2(Qt, d, n t +i) depends only on Qt, d and n t +i. Thus, we have that 

argmax/(e t+ i,y t+ i;{x}* +1 ,{y}* 1 ) = argminlog |xf +1 (K t + x t+ ixf +1 ) _1 x t+ i| 



Ut + l 



Uf+1 



= arg min log 

u t+ i 



H+l 



K 



= arg min log 

u t+ i 



1 + x m K t lx t+i 

xf +1 K t -1 x t+1 



x t+ i 



1 + x f+i K t lx *+i 
argminxLiK^^t+i 



U t +1 



In practice, we perform this optimization in an appropriate domain U. Thus, the D-optimal 
interrogation scheme for noise estimation is as follows 



opt ■ T 

u f +i = argrnmx t+1 K t Xt+1 . 



(23) 



It is worth noting that this D-optimal cost function for noise estimation and the D-optimal 
cost function derived for parameter estimation in (fl5j) are not compatible with each other. 
Estimating one of them quickly will necessarily delay the estimation of the other. 

In Section 15.11 we see that for large enough t values, expression (J23j) gives rise to control 
values close to u f = 0. 

5.1 Greedy and Non-Greedy Optimization 

Greedy optimization of (f23]) is simple, provided that K t is fixed during the optimization of 
u t+1 . If so, then the optimization task is quadratic. To see this, let us partition matrix K f as 
follows: 

where K t n G ffi rfxd ,Kf e R™-^, R 22 g R m-dxm-d_ It ig eagy to gee that if domain U in 
is large enough then 

u& = (Kf )" 1 K? 1 r t . (24) 

However, for non-greedy solutions, expression K t in x^ 1 _ 1 K f T 1 x t+ i changes, because it may 
depend on previous control inputs ui, . . . , u t , the subject of previous optimization steps. The 
optimal strategy for long-term non-greedy optimization falls outside of the scope of the present 
work. Here we propose the following heuristics for this problem: Use the strategy of Table Q] 
for the first r steps. It increases |K t | quickly in f23|) . Then after r-steps switch to the control 
described in lj24j) . This will decrease the cost function (f23|) further. We will call this non-greedy 
interrogation heuristics introduced for noise estimation 'r-infomax noise interrogation'. 

It is worth noting that in the r-infomax noise interrogation, if the r switching time is large 
enough then for large t values |K| 2 1 will be large, and hence — according to ([24| — the optimal u t 
interrogation will be close to 0. The approximation of the 'r-infomax noise interrogation' when 
we use the interrogation described in Table Q] for r steps and then switch to zero-interrogation 
will be called the 'r-zero interrogation' scheme. 



6 Discussion and Conclusions 

We have treated the identification problem of recurrent neural networks described by model ([I]) . 
We applied active learning to solve this task. In particular, the online D-optimality principle 
was applied and we investigated the learning properties for parameter and noise estimations. 
We note that the D-optimal interrogation scheme is also called infomax control in the literature 
[6]. This name originates from the cost function that optimizes the mutual information. 
The GLM model used by [6] is as follows: 

I 1 J \ 

r t +i =g y^Fjr t _j + y]BjU t+ i-j + e t +i, (25) 

\ »=o j=o I 



where {e t } is i.i.d. noise with E R d 



mean. The authors model spiking neurons and assume 



that the main source of the noise is this spiking, which appears at the output of the neurons and 
adds linearly to the neural activity. They investigated the case in which the observed quantity r t 
had a Poisson distribution. Unfortunately, in this model Bayesian equations become intractable 
and the estimation of the posterior may be spoiled, because the distribution is projected to the 
family of normal distributions at each instant. A serious problem with this approach is that the 
extent of the information loss caused by this approximation is not known. Our stochastic RNN 
model 



differs only slightly from the GLM model of (|25l) . but it has considerable advantages, as we 
shall discuss below. 

Bayesian designs of different kinds were derived for the linear regression problem in [35J: 



This problem is similar to ours (([I])-©), but while the goal of [35] was to find an optimal 
design for the explanatory variables 8, we were concerned with the parameter (X in ([261 ) and 
the noise (e) estimation task. In Verdinelli's paper inverted gamma prior and vector-valued 
normal distribution were assumed on the isotropic noise and on the explanatory variables, 
respectively. By contrast, we were interested in the matrix-valued coefficients and in general, 
non-isotropic noises. We used matrix- valued normal distribution for the coefficients and inverted 
Wishart distribution for the covariance matrix as conjugate priors. Due to the inverted Wishart 
distribution that we used, the covariance matrix of the noise is not restricted to the isotropic 
form, but can be general in our case. 

The Bayesian online learning framework allowed us to derive analytic results for the greedy 
optimization of the parameters as well as the driving noise. Optimal interrogation strategies 
(fl6|) and 11231) appeared in attractive, intriguingly simple quadratic forms. We have shown that 
these two tasks are incompatible with each other. Parameter and noise estimations require the 
maximization and the minimization of expression xJ +1 H^ 1 nt+i, respectively. 

The problem of non-greedy optimization of the full task has been left open. However, we 
put forth a heuristic solution for the estimation of the driving noise that we called r-infomax 
noise interrogation. It uses the D-optimal interrogation of Table Q] up to r-steps, and applies 
the noise estimation control of (J23J) afterwards. This heuristics decreases the estimation error of 
the coefficients of matrices F and B up to time r and thus — upon turning off the explorative 
D-optimization — tries to minimize the estimation error of the value of the noise at time r + 1. 
We introduced the r-zero interrogation scheme and showed that it is a good approximation of 
the r-infomax noise scheme for large r values. 

Finally, it seems desirable to determine the conditions under which the algorithms derived 
from the D-optimal (infomax) principle are both consistent and efficient. The tractable form of 
our approximation-free results is promising in this respect. 




P(e) 



y = X0 + e 
= A/;(0,a 2 I). 



(26) 
(27) 
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